Dataset Details

The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period

The dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the Target Variable.

The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action

Administrative,Informational,Product Related - Different types of pages visited by the visitor in that session

Administrative Duration,Informational Duration,Product Related Duration -Total time spent in each of these page categories.

METRICS measured by google analytics for each page in e-commerce

Rate - The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session

Exit Rate - The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.

Page Value - The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction.

Special Day - The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Importing necessary libraries

Importing the Dataset

Dataset information

Null values in the Dataset

Visualization

Univariate Analysis

Bi-Variate Analysis

Outlier Detection

there are outliers,we already seen it in the box plot also

Multicollinearity

Type Casting

Categorical Column Treatment

Multicollinearity

Checking for Class Imbalance

Performing relevant tests to find out the Independent Variables are significant to the Target variable

The Numerical columns is tested for two sample t test and Categorical columns is tested for chi square test.

Assumptions of two sample t test- Shapiro test and levene test.

Shapiro Test: Ho: skew=0 Ha: skew!=0 Levene Test: Ho: variances are equal Ha: variances are not equal

Train Test Split and Scaling

Checking whther train and test representative of the overall data

Scaling

Base Model Building

Evaluating the base models

Building Base model with class imbalance treatment

Tuning the base model

Decision Tree TUNING

In tuning the Decision Tree Bias score increases, the Variance error is decreased

KNN TUNING

In tuning the KNN, Bias score increases but the Variance error is decreased

In tuning the Random Forest, Bias score increases but the Variance error is decreased

Backward Selected Features gives the best result in building the Logistic regression model

Results after Tuning

  1. The performance of every model Increses way better than before tuning.
  2. To increase the bias score even further we perform boosting

Boosting

Stacking